NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accurate prediction of nucleic acid binding proteins using protein language model

https://doi.org/10.1093/bioadv/vbaf008

Wu, Siwen; Xu, Jinbo; Guo, Jun-tao; Bateman, ed., Alex (January 2025, Bioinformatics Advances)

Abstract MotivationNucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions. ResultsTo improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored. Availability and implementationThe datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.
more » « less
Improved prediction of DNA and RNA binding proteins with deep learning models

https://doi.org/10.1093/bib/bbae285

Wu, Siwen; Guo, Jun-tao (June 2024, Briefings in Bioinformatics)

Abstract Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types of NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning-based methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given protein. The deep learning models employ two layers of convolutional neural network and one layer of long short-term memory. Our approaches outperform existing DBP and RBP predictors with a balanced prediction between DBPs and RBPs, and are more practically useful in identifying novel NABPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with ~12% improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins and their effect on the overall prediction accuracy of NABP predictions.
more » « less
Prevalent use and evolution of exonic regulatory sequences in the human genome

https://doi.org/10.1002/ntls.20220058

Chen, Jing; Ni, Pengyu; Wu, Siwen; Niu, Meng; Guo, Jun‐tao; Su, Zhengsheng (March 2023, Natural Sciences)

Abstract It has long been known that exons can serve ascis‐regulatory sequences, such as enhancers. However, the prevalence of such dual‐use of exons and how they evolve remain elusive. Based on our recently predicted, highly accurate large sets ofcis‐regulatory module candidates (CRMCs) and non‐CRMCs in the human genome, we find that exonic transcription factor binding sites (TFBSs) occupy at least a third of the total exon lengths, and 96.7% of genes have exonic TFBSs. Both A/T and C/G in exonic TFBSs are more likely under evolutionary constraints than those in non‐CRMC exons. Exonic TFBSs in codons tend to encode loops rather than more critical helices and strands in protein structures, while exonic TFBSs in untranslated regions (UTRs) tend to avoid positions where known UTR‐related functions are located. Moreover, active exonic TFBSs tend to be in close physical proximity to distal promoters whose genes have elevated transcription levels. These results suggest that exonic TFBSs might be more prevalent than originally thought and likely in dual‐use. We proposed a parsimonious model that well explains the observed evolutionary behaviors of exonic TFBS as well as how a stretch of codons evolve into a TFBS. Key pointsThere are more exonic regulatory sequences in the human genome than originally thought.Exonic transcription factor binding sites are more likely under negative selection or positive selection than counterpart nonregulatory sequences.Exonic transcription factor binding sites tend to be located in genome sequences that encode less critical loops in protein structures, or in less critical parts in 5′ and 3′ untranslated regions.
more » « less

Search for: All records